Sales prediction model for State of Connecticut Cannabis Retail Sales¶
This data set contains preliminary weekly retail sales data for cannabis and cannabis products in both the adult-use cannabis and medical marijuana markets. The data reported is compiled at specific points in time and only captures data current at the time the report is generated. The weekly data set captures retail cannabis sales from Sunday through Saturday of the week. Weeks spanning across two different months only include days within the same month. The first and last week of each month may show lower sales as they may not be made up of a full week (7 days). Data values may be updated and change over time as updates occur. Accordingly, weekly reported data may not exactly match annually reported data.
Source Data : https://catalog.data.gov/dataset/cannabis-retail-sales-by-week-ending
Return Home : https://johnkimaiyo.vercel.app/
Creating a prediction model using Python and Pandas involves several steps, including data preprocessing, exploratory data analysis, feature engineering, model selection, training, and evaluation.
Step 1: Import Necessary Libraries¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import joblib
Step 2: Load the Dataset¶
Cannibas_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Cannibas Retail Sales\Machine Learning\Source Data\Cannabis_Retail_Sales_by_Week_Ending.csv")
# Display the first few rows of the dataset
print(Cannibas_Sales_df.head())
Week Ending Adult-Use Retail Sales Medical Marijuana Retail Sales \ 0 01/14/2023 1485019.32 1776700.69 1 01/21/2023 1487815.81 2702525.61 2 01/28/2023 1553216.30 2726237.56 3 01/31/2023 578840.62 863287.86 4 02/04/2023 1047436.20 1971731.40 Total Adult-Use and Medical Sales Adult-Use Products Sold \ 0 3261720.01 33610 1 4190341.42 33005 2 4279453.86 34854 3 1442128.48 12990 4 3019167.60 24134 Medical Products Sold Total Products Sold \ 0 49312 82922 1 77461 110466 2 76450 111304 3 24023 37013 4 56666 80800 Adult-Use Average Product Price Medical Average Product Price 0 44.25 36.23 1 45.08 34.89 2 44.56 35.65 3 44.56 35.93 4 43.49 34.84
Step 3: Data Preprocessing¶
Before building the model, you need to preprocess the data. This includes handling missing values, converting data types, and encoding categorical variables if necessary.
# Check for missing values
print(Cannibas_Sales_df.isnull().sum())
# Convert 'Week Ending' to datetime format
Cannibas_Sales_df['Week Ending'] = pd.to_datetime(Cannibas_Sales_df['Week Ending'])
# Extract year, month, and day from the date
Cannibas_Sales_df['Year'] = Cannibas_Sales_df['Week Ending'].dt.year
Cannibas_Sales_df['Month'] = Cannibas_Sales_df['Week Ending'].dt.month
Cannibas_Sales_df['Day'] = Cannibas_Sales_df['Week Ending'].dt.day
# Drop the original 'Week Ending' column
Cannibas_Sales_df.drop('Week Ending', axis=1, inplace=True)
# Display the first few rows after preprocessing
print(Cannibas_Sales_df.head())
Week Ending 0 Adult-Use Retail Sales 0 Medical Marijuana Retail Sales 0 Total Adult-Use and Medical Sales 0 Adult-Use Products Sold 0 Medical Products Sold 0 Total Products Sold 0 Adult-Use Average Product Price 0 Medical Average Product Price 0 dtype: int64 Adult-Use Retail Sales Medical Marijuana Retail Sales \ 0 1485019.32 1776700.69 1 1487815.81 2702525.61 2 1553216.30 2726237.56 3 578840.62 863287.86 4 1047436.20 1971731.40 Total Adult-Use and Medical Sales Adult-Use Products Sold \ 0 3261720.01 33610 1 4190341.42 33005 2 4279453.86 34854 3 1442128.48 12990 4 3019167.60 24134 Medical Products Sold Total Products Sold \ 0 49312 82922 1 77461 110466 2 76450 111304 3 24023 37013 4 56666 80800 Adult-Use Average Product Price Medical Average Product Price Year \ 0 44.25 36.23 2023 1 45.08 34.89 2023 2 44.56 35.65 2023 3 44.56 35.93 2023 4 43.49 34.84 2023 Month Day 0 1 14 1 1 21 2 1 28 3 1 31 4 2 4
Step 4: Exploratory Data Analysis (EDA)¶
Perform some basic EDA to understand the data distribution and relationships between variables.
# Summary statistics
print(Cannibas_Sales_df.describe())
# Correlation matrix
print(Cannibas_Sales_df.corr())
# Plotting the correlation matrix
import seaborn as sns
sns.heatmap(Cannibas_Sales_df.corr(), annot=True, cmap='coolwarm')
plt.show()
Adult-Use Retail Sales Medical Marijuana Retail Sales \ count 1.290000e+02 1.290000e+02 mean 2.805301e+06 1.777271e+06 std 1.119186e+06 6.973442e+05 min 1.639950e+05 6.283767e+04 25% 2.005884e+06 1.458784e+06 50% 3.154663e+06 1.818867e+06 75% 3.781082e+06 2.365348e+06 max 4.495102e+06 3.085787e+06 Total Adult-Use and Medical Sales Adult-Use Products Sold \ count 1.290000e+02 129.000000 mean 4.582549e+06 71854.674419 std 1.560073e+06 30263.936939 min 2.268327e+05 4188.000000 25% 3.815815e+06 51174.000000 50% 5.385123e+06 81333.000000 75% 5.599181e+06 96544.000000 max 7.290974e+06 120223.000000 Medical Products Sold Total Products Sold \ count 129.000000 129.000000 mean 49059.937984 121017.155039 std 19173.146419 42855.211462 min 1916.000000 6104.000000 25% 41914.000000 96853.000000 50% 51266.000000 140225.000000 75% 62499.000000 148744.000000 max 86307.000000 199162.000000 Adult-Use Average Product Price Medical Average Product Price \ count 129.000000 129.000000 mean 39.163566 35.965271 std 1.661305 1.734351 min 35.550000 32.800000 25% 38.140000 34.750000 50% 39.080000 35.650000 75% 39.970000 36.830000 max 45.080000 41.830000 Year Month Day count 129.000000 129.000000 129.000000 mean 2023.558140 6.325581 18.294574 std 0.571552 3.531472 9.731074 min 2023.000000 1.000000 1.000000 25% 2023.000000 3.000000 10.000000 50% 2024.000000 6.000000 19.000000 75% 2024.000000 9.000000 28.000000 max 2025.000000 12.000000 31.000000 Adult-Use Retail Sales \ Adult-Use Retail Sales 1.000000 Medical Marijuana Retail Sales 0.445148 Total Adult-Use and Medical Sales 0.916391 Adult-Use Products Sold 0.985865 Medical Products Sold 0.487862 Total Products Sold 0.914167 Adult-Use Average Product Price -0.388460 Medical Average Product Price -0.291913 Year 0.396208 Month 0.279192 Day 0.026906 Medical Marijuana Retail Sales \ Adult-Use Retail Sales 0.445148 Medical Marijuana Retail Sales 1.000000 Total Adult-Use and Medical Sales 0.766368 Adult-Use Products Sold 0.418163 Medical Products Sold 0.987573 Total Products Sold 0.736314 Adult-Use Average Product Price 0.252361 Medical Average Product Price 0.253649 Year -0.423685 Month -0.096965 Day 0.006971 Total Adult-Use and Medical Sales \ Adult-Use Retail Sales 0.916391 Medical Marijuana Retail Sales 0.766368 Total Adult-Use and Medical Sales 1.000000 Adult-Use Products Sold 0.894187 Medical Products Sold 0.791456 Total Products Sold 0.984970 Adult-Use Average Product Price -0.165875 Medical Average Product Price -0.096031 Year 0.094840 Month 0.156928 Day 0.022443 Adult-Use Products Sold \ Adult-Use Retail Sales 0.985865 Medical Marijuana Retail Sales 0.418163 Total Adult-Use and Medical Sales 0.894187 Adult-Use Products Sold 1.000000 Medical Products Sold 0.478523 Total Products Sold 0.920014 Adult-Use Average Product Price -0.438852 Medical Average Product Price -0.303896 Year 0.393841 Month 0.296591 Day 0.020368 Medical Products Sold Total Products Sold \ Adult-Use Retail Sales 0.487862 0.914167 Medical Marijuana Retail Sales 0.987573 0.736314 Total Adult-Use and Medical Sales 0.791456 0.984970 Adult-Use Products Sold 0.478523 0.920014 Medical Products Sold 1.000000 0.784075 Total Products Sold 0.784075 1.000000 Adult-Use Average Product Price 0.223924 -0.209915 Medical Average Product Price 0.159102 -0.143934 Year -0.364514 0.115209 Month -0.100276 0.164167 Day -0.002415 0.016054 Adult-Use Average Product Price \ Adult-Use Retail Sales -0.388460 Medical Marijuana Retail Sales 0.252361 Total Adult-Use and Medical Sales -0.165875 Adult-Use Products Sold -0.438852 Medical Products Sold 0.223924 Total Products Sold -0.209915 Adult-Use Average Product Price 1.000000 Medical Average Product Price 0.347670 Year -0.351466 Month -0.555304 Day -0.111863 Medical Average Product Price Year \ Adult-Use Retail Sales -0.291913 0.396208 Medical Marijuana Retail Sales 0.253649 -0.423685 Total Adult-Use and Medical Sales -0.096031 0.094840 Adult-Use Products Sold -0.303896 0.393841 Medical Products Sold 0.159102 -0.364514 Total Products Sold -0.143934 0.115209 Adult-Use Average Product Price 0.347670 -0.351466 Medical Average Product Price 1.000000 -0.619702 Year -0.619702 1.000000 Month -0.056917 -0.179758 Day -0.158841 -0.003103 Month Day Adult-Use Retail Sales 0.279192 0.026906 Medical Marijuana Retail Sales -0.096965 0.006971 Total Adult-Use and Medical Sales 0.156928 0.022443 Adult-Use Products Sold 0.296591 0.020368 Medical Products Sold -0.100276 -0.002415 Total Products Sold 0.164167 0.016054 Adult-Use Average Product Price -0.555304 -0.111863 Medical Average Product Price -0.056917 -0.158841 Year -0.179758 -0.003103 Month 1.000000 -0.010315 Day -0.010315 1.000000
Step 5: Feature Engineering¶
Feature engineering involves creating new features or transforming existing ones to improve the model's performance.
# Create a new feature: Total Products Sold per Week
Cannibas_Sales_df['Total Products Sold per Week'] = Cannibas_Sales_df['Adult-Use Products Sold'] + Cannibas_Sales_df['Medical Products Sold']
# Display the first few rows after feature engineering
print(Cannibas_Sales_df.head())
Adult-Use Retail Sales Medical Marijuana Retail Sales \ 0 1485019.32 1776700.69 1 1487815.81 2702525.61 2 1553216.30 2726237.56 3 578840.62 863287.86 4 1047436.20 1971731.40 Total Adult-Use and Medical Sales Adult-Use Products Sold \ 0 3261720.01 33610 1 4190341.42 33005 2 4279453.86 34854 3 1442128.48 12990 4 3019167.60 24134 Medical Products Sold Total Products Sold \ 0 49312 82922 1 77461 110466 2 76450 111304 3 24023 37013 4 56666 80800 Adult-Use Average Product Price Medical Average Product Price Year \ 0 44.25 36.23 2023 1 45.08 34.89 2023 2 44.56 35.65 2023 3 44.56 35.93 2023 4 43.49 34.84 2023 Month Day Total Products Sold per Week 0 1 14 82922 1 1 21 110466 2 1 28 111304 3 1 31 37013 4 2 4 80800
Step 6: Splitting the Data¶
Split the data into training and testing sets
# Define features (X) and target (y)
X = Cannibas_Sales_df.drop(['Total Adult-Use and Medical Sales'], axis=1)
y = Cannibas_Sales_df['Total Adult-Use and Medical Sales']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)
(103, 11) (26, 11)
Step 7: Model Selection and Training¶
Choose a model and train it on the training data. For simplicity, we'll use a Linear Regression model.
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Step 8: Model Evaluation¶
Evaluate the model's performance on the test data.
# Make predictions
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Plot the actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
Mean Squared Error: 346692.3760245455
Step 9: Making Predictions¶
You can now use the trained model to make predictions on new data.
# Example: Predict on new data
new_data = pd.DataFrame({
'Adult-Use Retail Sales': [1500000],
'Medical Marijuana Retail Sales': [1800000],
'Adult-Use Products Sold': [30000],
'Medical Products Sold': [50000],
'Total Products Sold': [80000],
'Adult-Use Average Product Price': [40],
'Medical Average Product Price': [35],
'Year': [2024],
'Month': [1],
'Day': [15],
'Total Products Sold per Week': [80000]
})
# Save the model to a file
joblib.dump(model, 'cannabis_sales_model.pkl')
predicted_sales = model.predict(new_data)
print(f'Predicted Total Sales: {predicted_sales[0]}')
Predicted Total Sales: 3300001.5161294458
Summary¶
Import Libraries: Import necessary libraries like Pandas, NumPy, and Scikit-learn.
Load Data: Load the dataset into a Pandas DataFrame.
Preprocess Data: Handle missing values, convert data types, and create new features.
EDA: Perform exploratory data analysis to understand the data.
Feature Engineering: Create new features or transform existing ones.
Split Data: Split the data into training and testing sets.
Train Model: Choose a model and train it on the training data.
Evaluate Model: Evaluate the model's performance on the test data.
Make Predictions: Use the trained model to make predictions on new data